Goto

Collaborating Authors

 public domain


Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset Peter Henderson

Neural Information Processing Systems

Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.


Checklist 1. For all authors (a)

Neural Information Processing Systems

Another limitation is that the linear model seems to outperform the rank-one quadratic model; we do not fully understand this effect, as discussed in the last paragraph of section 4. A third limitation is that models need to be averaged across time to obtain a single, deployable model: see Figure 5. A final limitation is that we do not yet have convergence theorems or regret bounds for the passive-aggressive updates in these models; see the second paragraph of section 5. (c) Did you discuss any potential negative societal impacts of your work?


Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset Peter Henderson

Neural Information Processing Systems

Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.


Checklist 1. For all authors (a)

Neural Information Processing Systems

Another limitation is that the linear model seems to outperform the rank-one quadratic model; we do not fully understand this effect, as discussed in the last paragraph of section 4. A third limitation is that models need to be averaged across time to obtain a single, deployable model: see Figure 5. A final limitation is that we do not yet have convergence theorems or regret bounds for the passive-aggressive updates in these models; see the second paragraph of section 5. (c) Did you discuss any potential negative societal impacts of your work?


Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding

Gartner, Mike

arXiv.org Artificial Intelligence

U.S. health insurance is complex, and inadequate understanding and limited access to justice have dire implications for the most vulnerable. Advances in natural language processing present an opportunity to support efficient, case-specific understanding, and to improve access to justice and healthcare. Yet existing corpora lack context necessary for assessing even simple cases. We collect and release a corpus of reputable legal and medical text related to U.S. health insurance. We also introduce an outcome prediction task for health insurance appeals designed to support regulatory and patient self-help applications, and release a labeled benchmark for our task, and models trained on it.


Utah's High-Stakes PR Campaign to Wrest Control of Public Lands

Mother Jones

Utah Attorney General Sean Reyes speaks at the Utah State Capitol in Salt Lake City, last year after state leaders announced they are suing the federal government over 18.5 million acres of Bureau of Land Management land, which covers about 34% of Utah.Saige Miller / KUER via High Country News This story was originally published by High Country News and Public Domain and is reproduced here as part of the Climate Desk collaboration. Last year, as Utah prepared to file a federal lawsuit aiming to take control of millions of acres of federal public land within its borders, state officials sought help swaying public opinion in their favor. So they turned to a group of public relations professionals at Penna Powers, a media and branding firm based in Salt Lake City. Backed with a commitment of more than two million in taxpayer funds, the firm sprang into action. One of the early orders of business was studying the opposition. In June 2024, an assistant attorney general sent an email to numerous state government colleagues and Penna Powers staffers that contained a video from the Theodore Roosevelt Conservation Partnership (TRCP) in which the well-known hunter and media personality Randy Newberg described the dangers of transferring federal land to state control.


Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Langlais, Pierre-Carl, Hinostroza, Carlos Rosas, Nee, Mattia, Arnett, Catherine, Chizhov, Pavel, Jones, Eliot Krzystof, Girard, Irène, Mach, David, Stasenko, Anastasia, Yamshchikov, Ivan P.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.


The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Bommarito, Michael J II, Bommarito, Jillian, Katz, Daniel Martin

arXiv.org Artificial Intelligence

Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.


China's DeepSeek impresses. But is a 'fast follow' good enough in AI?

Christian Science Monitor | Science

American stock markets shuddered on Monday, prompted by China's announcement that it has created a capable, cheap, artificial intelligence machine. It's the biggest cloud yet to darken the West's blue-sky enthusiasm over AI, calling into question the efficacy of America's export controls and the billions of dollars the United States is pouring into the technology's expensive cutting edge. China startup DeepSeek says its AI assistant uses less advanced chips than its rivals' models do, and it costs less to train. Unlike the West's billions, the Chinese model was developed for just 5.6 million, by one estimate. "Are we going to spend 500 billion to get to the frontier so that China can find a way to copy our homework for pennies on the dollar?"


Towards Best Practices for Open Datasets for LLM Training

Baack, Stefan, Biderman, Stella, Odrozek, Kasia, Skowron, Aviya, Bdeir, Ayah, Bommarito, Jillian, Ding, Jennifer, Gahntz, Maximilian, Keller, Paul, Langlais, Pierre-Carl, Lindahl, Greg, Majstorovic, Sebastian, Marda, Nik, Penedo, Guilherme, Van Segbroeck, Maarten, Wang, Jennifer, von Werra, Leandro, Baker, Mitchell, Belião, Julie, Chmielinski, Kasia, Fadaee, Marzieh, Gutermuth, Lisa, Kydlíček, Hynek, Leppert, Greg, Lewis-Jong, EM, Larsen, Solana, Longpre, Shayne, Lungati, Angela Oduor, Miller, Cullen, Miller, Victor, Ryabinin, Max, Siminyu, Kathleen, Strait, Andrew, Surman, Mark, Tumadóttir, Anna, Weber, Maurice, Weiss, Rebecca, White, Lee, Wolf, Thomas

arXiv.org Artificial Intelligence

Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.